Examine the smallest and largest values in numerical data: Have the software show you the
smallest and largest values for each numerical variable. This check can often catch decimal-point
errors (such as a hemoglobin value of 125 g/dL instead of 12.5 g/dL) or transposition errors (for
example, a weight of 517 pounds instead of 157 pounds).
Sort the values of variables: If your program can show you a sorted list of all the values for a
variable, that’s even better — it often shows misclassified categories as well as numerical
outliers.
Search for blanks and commas: You can have Excel search for blanks in category values that
shouldn’t have blanks, or for commas in numeric variables. Make sure the “Match entire cell
contents” option is deselected in the Find and Replace dialog box (you may have to click the
Options button to see the check box). This operation can also be done using statistical software. Be
wary if there a large number of missing values, because this could indicate a data collection
problem.
Tabulate categorical variables: You can have your statistics program tabulate each categorical
variable (showing you the frequency each different category occurred in your data). This check
usually finds misclassified categories. Note that blanks and special characters in character
variables may cause incorrect results when querying, which is why it is important to do this check.
Spot-checking data entry: If doing data entry from forms or printed material, choose a percentage
to double-check (for example, 10 percent of the forms you entered). This can help you tell if there
are any systematic data entry errors or missing data.
Creating a File that Describes Your Data File
Every research database, large or small, simple or complicated, should include a data dictionary that
describes the variables contained in the database. It is a necessary part of study documentation that
needs to be accessible to the research team. A data dictionary is usually set up as a table (often in
Excel), where each row provides documentation for each variable in the database. For each variable,
the dictionary should contain the following information (sometimes referred to as metadata, which
means “data about data”):
A variable name (usually no more than ten characters) that’s used when telling the software what
variables you want it to use in an analysis
A longer verbal description of the variable in a human-readable format (in other words, a person
reading this description should be able to understand the content of the variable)
The type of data (text, categorical, numerical, date/time, and so on)
If numeric: Information about how that number is displayed (how many digits are before
and after the decimal point)
If date/time: How it’s formatted (for example, 12/25/13 10:50pm or 25Dec2013 22:50)
If categorical: What codes and descriptors exist for each level of the category (these are
often called picklists, and can be documented on a separate tab in an Excel data dictionary)
How missing values are represented in the database (99, 999, “NA,” and so on)